Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

نویسندگان

  • Blesson Varghese
  • Gerard T. McKee
  • Vassil Alexandrov
چکیده

BACKGROUND Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core׳s job onto reliable cores can make a significant step towards automating fault tolerance. METHOD This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. RESULT The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

ارائه یک رویکرد همانند سازی شده عامل محور در اجرای یک الگوی کد متحرک مطمئن

Abstract Using mobile agents, it is possible to bring the code close to the resources, which is not foreseen by the traditional client/server paradigm. Compared to the client/server computing paradigm, the greater flexibility of the mobile agent paradigm comes at additional costs as well as the additional complexity of developing and managing mobile agent-based applications. Such complexity ...

متن کامل

A Survey on Fault Tolerant Multi Agent System

A mult i-agent system (MAS) is formed by a number of agents connected together to achieve the desired goals specified by the design. Usually in a multi agent system, agents work on behalf of a user to accomplish given goals. In MAS co-ordination, co-operation, negotiation and communication are important aspects to achieve fault tolerance in MAS. The multi-agent system is likely to fail in a dis...

متن کامل

Designing General, Composable, and Middleware-independent Grid Infrastructure Tools for Multi-tiered Job Management

We propose a multi-tiered architecture for middleware-independent Grid job management. The architecture consists of a number of services for well-defined tasks in the job management process, offering complete user-level isolation of service capabilities, multiple layers of abstraction, control, and fault tolerance. The middleware abstraction layer comprises components for targeted job submissio...

متن کامل

Monitoring and steering Grid applications with GRID superscalar

We present the design and implementation of a general task monitoring and steering system for Grid applications (GSTAT). The system is integrated in theGRID superscalar (GRIDSs) programming framework. Information at the application, Grid node, and individual task levels are supplied upon request. Using the steering capabilities, individual tasks or the whole application can be cancelled. The co...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computers in biology and medicine

دوره 48  شماره 

صفحات  -

تاریخ انتشار 2014